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Abstract. We present an active learning algorithm for inferring ex- 
tended finite state machines (EFSM)s, combining data flow and control 
behavior. Key to our learning technique is a novel learning model based 
on so-called tree queries. The learning algorithm uses the tree queries to 
infer symbolic data constraints on parameters, e.g., sequence numbers, 
time stamps, identifiers, or even simple arithmetic. We describe sufficient 
conditions for the properties that the symbolic constraints provided by a 
tree query in general must have to be usable in our learning model. We 
have evaluated our algorithm in a black-box scenario, where tree queries 
are realized through (black-box) testing. Our case studies include con- 
nection establishment in TCP and a priority queue from the Java Class 
Library. 


1 Introduction 

Behavioral models of components and interfaces are the basis for many powerful 
software development and verification techniques, such as model checking, model 
based test generation, controller synthesis, and service composition. Ideally, such 
models should be part of documentation (e.g., of a component library), but 
in practice they are often nonexistent or outdated. To address this problem, 
techniques for automatically generating models of component behavior are being 
developed. These techniques can be based on static analysis, dynamic analysis, 
or a combination of both approaches. Static analysis of a component requires 
access to its source code; so when source code is not available, or when models 
must be generated on the fly, dynamic analysis is a better alternative. 

In dynamic analysis, test executions are used to drive and observe compo- 
nent behavior. Mature techniques for generating finite-state models, describing 
the possible orderings of interactions between a component and its environment, 
have been developed to support, e.g., interface modeling [4], test generation [27], 
and security analysis [23]. However, faithful models should capture not only the 
ordering between interactions (control flow aspects), but also the constraints 
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on any data parameters passed with these interactions (data flow aspects). Data 
flow aspects are commonly captured by extending finite state machines with vari- 
ables. Together with the data parameters passed with interactions, the variables 
influence the control flow by means of guards, and the control flow can cause 
updates of variables. Different dialects of extended finite state machines (EF- 
SMs) are successfully used in tools for model-based testing [18], software model 
checking [19], and model-based development [11]. However, dynamic analysis 
techniques that generate EFSM models with guards and assignments to vari- 
ables are still lacking: existing techniques either handle only a limited range of 
operations on data (typically only equality [16,15]), require significant manual 
effort [2], or rely on access to source code. 

In this paper, we present a black-box technique for generating register au- 
tomata (RAs), which are a particular form of EFSMs in which transitions are 
equipped with guards and assignments to variables (called registers ) . Our contri- 
bution is an active automata learning algorithm for RAs, which is parameterized 
on a particular theory , i.e. , a set of operations and tests on the data domain that 
can be used in guards. By an appropriate choice of theory, we can infer RA 
models where data parameters and variables represent sequence numbers, time 
stamps, numbers with limited arithmetic, identifiers, etc. 

Our algorithm has been evaluated in a black-box scenario, using SMT-based 
test generation for realizing tree queries for integers with addition (+), equal- 
ities (=), and inequalities (<,>). We have learned models of the connection 
establishment in TCP and the priority queue from the Java Class Library. 

Illustrating example. We give an example of an RA that can be generated 
using our technique. We begin by describing the language that it recognizes. 
Consider a simplistic sliding window protocol without retransmission, with a 
window of size two, in which the receipt of messages must be acknowledged in 
order. The protocol is described as a data language C seq over messages of form 
msg(d) and ack(d ), where d ranges over natural numbers. A sequence of mes- 
sages a = msg(di) . . . ack(d m ) is in the language C seq if (i) er has equally many 
msg and ack messages, (ii) the data parameter d in each msg(d)- message must 
be one more than the data parameter of the previous msg- message, (iii) the 
data parameter d in each acfc(d)-message must be one more than the data pa- 
rameter of the previous acfc-message. (iv) whenever msg(d) immediately pre- 
cedes ack(d ') 1 then d — 1 < d' < d. Sequences msg(l)ack(l)msg(2)ack(2) and 
msg(l)msg(2)ack(l)ack(2) are examples of data words in C seq . 

Fig. 1 shows a register automaton that accepts C seq . Locations are annotated 
with registers. Accepting locations are denoted by double circles; Iq is the initial 
location. Transitions are denoted by arrows and labeled with a message, a guard 
over parameters of the message and registers of the automaton, and an assign- 
ment to these registers. A sink location and its adjacent transitions are omitted 
in the figure. The automaton processes sequences a by first moving from Iq to 
l\ and storing the data value of the initial msg in X\. It then moves between 
locations l\ (waiting for an ack), I 2 (waiting for two acks), and I 3 (accepting). 
C seq is used as a running example throughout the paper. 
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Fig. 1: A simple sliding window protocol with sequence numbers. 

Main ideas. In classic active learning for finite automata (e.g., L* [5]), each 
location of an inferred automaton is identified by a word that reaches it from the 
initial location. Two words lead to the same location if they behave the same 
when prepended to the same suffix (i.e. , both are accepted or both rejected). 
Similarly, each location in the RAs we infer is identified by a data word. To 
determine whether two data words represent the same location, it is, however, 
not sufficient to check whether they behave the same when prepended to the 
same suffix, since we want to model relations between data parameters and not 
concrete data values. For example, when learning C seq , we might wrongly de- 
duce that msg( 3) and msg( 1) represent different locations, by observing that 
msg(3)ack(3) £ C seq but msg(\)ack(3) ^ C 8eq . To remedy this, we have gener- 
alized the L* algorithm to the symbolic setting. 

We describe our learning framework as a game between a learner and a 
teacher: the learner has to infer an automaton model of an unknown target 
language by making queries to a teacher who knows it. The concept of a teacher 
is an abstraction that helps us separate different concerns; the concrete learning 
framework is defined by the types of queries that the teacher can answer, and 
the class of languages that can be learned. 

Teacher. In our framework, the Teacher answers equivalence queries and tree 
queries. The answer to an equivalence query tells us if a conjectured automaton 
is correct, i.e., it accepts the unknown language. If not, the teacher provides a 
counterexample, i.e., a data word that is in the language but not accepted by 
the conjectured automaton, or vice versa. In practice, counterexamples can be 
provided by, e.g., conformance testing or monitoring. 

A tree query consists of a concrete prefix (e.g., a sequence of messages where 
data parameters are instantiated with concrete data values) and a symbolic suf- 
fix. Symbolic suffixes are obtained from concrete suffixes by replacing data values 
by symbolic parameters (e.g., ack(p)). The answer to a tree query is a symbolic 
decision tree (SDT), which describes which instantiations of the symbolic suffix 
are accepted and which are rejected. Fig. 2 shows examples of SDTs for C seq . 
We depict trees with the root location at the top and annotate locations with 
registers. A register in the root location with index i holds the *-th data value 
of the corresponding prefix. The trees describe the fragments of C seq for suffixes 
of form ack(p) after prefixes msg( 1) (Tree [a]) and msg(l)ack{l)msg(2) (Tree 
[b]). They each have a register at the root location and two guarded initial tran- 


Fig. 2: Isomorphic SDTs for ack(p) after [a] msg( 1), and [b] msg(l)ack(l)msg(2) . 


sitions. In both trees, ack(p) leads to an accepting location only when the value 
of the parameter p is equal to the value of the register in the root location (i.e. , 
the value of the parameter from the most recent msg(p)). 

Learner. The learner infers a register automaton that accepts the unknown 
target language by making tree queries and equivalence queries. At a very ab- 
stract level, our learning algorithm builds a prefix-closed set of prefixes , i.e., test 
sequences with concrete data values that reach control locations of the inferred 
register automaton. To determine when prefixes should lead to the same control 
location in the automaton, the learner compares SDTs to each other. Prefixes 
with equivalent SDTs (isomorphic up to renaming of registers and locations) can 
be unified. The transitions of SDTs will be used to create registers, guards, and 
assignments in the automaton. For example, the trees in Fig. 2 are equivalent 
- meaning that the corresponding prefixes msg( 1) and msg(l)ack(l)msg(2) 
should lead to the same location. 

The learner submits the hypothesis automaton to an equivalence query. If the 
equivalence query is successful, the algorithm terminates; otherwise, a counterex- 
ample is returned. Counterexamples guide the algorithm to make tree queries 
for larger fragments of the target language, e.g., for more and/or longer suffixes 
after a given prefix. The resulting SDTs will lead to refinements in the hypoth- 
esis: previously unified prefixes may be split, new registers may be introduced, 
and transitions may be refined or new ones introduced. 

Related work. The problem of generating models from implementations has 
been addressed in a number of different ways. Proposed approaches range from 
mining source code [4], static analysis [25] and predicate abstraction [3,24] to 
dynamic analysis [12, 6, 28, 22]. Closest to our work are approaches that combine 
an automata learning algorithm with a method for inferring constraints on data. 
An early black-box approach to inferring EFSM-like models is [20] , where models 
are generated from execution traces by combining passive automata learning 
with the Daikon tool [10]. 

A number of approaches combine active automata learning with different 
methods for inferring constraints on data parameters. All these approaches follow 
a pattern similar to CEGAR (counterexample guided abstraction refinement). 
A sequence of models is refined in a process that is usually monotonic and con- 
verges to a hxpoint. Active automata learning has been combined with symbolic 
execution [13, 8] and an approach based on support vector machines [29] for 
inferring constraints on data parameters in white-box scenarios. In white-box 


learning scenarios (as in other static analyses) registers or state variables do not 
have to be inferred as they are readily available. Sometimes abstraction is used 
to reduce the size of constructed models. In contrast, our approach will infer 
models with a minimal set of required registers. 

Previous works based on active automata learning that infer data constraints 
from tests in a black-box scenario have been restricted to the case where the only 
operation on data is comparison for equality [16,1,7]. Other approaches infer 
models without symbolic data constraints [17, 23] or require manually provided 
abstractions on the data domain [2]. In general, black-box methods can infer 
complex (e.g., arithmetic) constraints only at a very high cost — if at all. Our 
black-box implementation is subject to these principal limitations, too. 

While existing approaches extend active learning to a fix class of behavioral 
models, we present a general purpose automata learning algorithm that can be 
combined with any method for generating data constraints (meeting the require- 
ments we discuss in this paper). 

Register automata are similar to the symbolic transducers of [26]. It is an 
open question if some of the decidability results for symbolic transducers can be 
adapted to RAs to help answer for which relations and operations tree queries 
and equivalence queries are decidable. 

Outline. In Sec. 2, we introduce register automata and data languages. In Sec. 3, 
we define symbolic decision trees and discuss how a tree oracle answers tree 
queries. We present the details of the learning algorithm in Sec. 4, and Sec. 5 
presents the results of applying it in a small series of experiments. Here, we 
also briefly describe the implementation of a teacher for our learning framework. 
Conclusions are in Sec. 6. 

2 Preliminaries 

In this section, we introduce the central concepts of our framework: theories, 
data languages, and register automata. 

Theories. Our framework is parameterized by a theory , which consists of an 
unbounded domain T> of data values , and 77. is a set of relations on T>. The 
relations in 77 can have arbitrary arity. Known constants can be represented by 
unary relations. For example, the theory of natural numbers with inequality is 
the theory (N, {<}) where N is the natural numbers and < is the inequality 
relation on N. In the following, we assume that some theory has been fixed. 

Data languages. We assume a set £ of actions , each with an arity that de- 
termines how many parameters it takes from the domain T>. In this paper, we 
assume that all actions have arity 1; it is straightforward to extend our results 
to the case where actions have arbitrary arity. A data symbol is a term of form 
a(d), where a is an action and d G T> is a data value. A data word is a sequence 
of data symbols. For a data word w = ai{d\) . . . a n ( d n ), let Acts(w ) denote its 
sequence of actions a\ . . . a n , and Vals(w) its sequence of data values d\ . . . d n . 
The concatenation of two data words w and w' is denoted ww' . Two data words 


w = ai(di) . . . a n (d n ) and w' = ai(d , 1 ) . . . a n {d' n ) are TZ-indistinguishable , de- 
noted w w' , if Acts{w) = Acts(w') and R(di 1 , . . . , d i;j ) o R{d' ii: . . . , dt.) 

whenever R £ 1Z and are indices between 1 and n. Intuitively, w and 

w' are ^-indistinguishable if they have the same sequences of actions and cannot 
be distinguished by the relations in 1Z. 

A data language C is a set of data words that respects 1Z in the sense that 
w w' implies w 6 £ w' € £. A data language can be represented as a 
mapping from the set of data words to {+, — }, where + stands for accept and 
— for reject. 

Register automata. Assume a set of registers (or variables), ranged over by 
Xi,X 2 , .... A parameterized symbol is a term of form a(p), where a is an action 
and p a formal parameter. A guard is a conjunction of negated and unnegated 
relations (from TV) over the parameter p and registers. An assignment is a simple 
parallel update of registers with values from registers or p. 

Definition 1. A register automaton (RA) is a tuple A = ( L , Iq, X , A, A), where 

— L is a finite set of locations , with Iq £ L as the initial location , 

— A maps each l € L to {+, — }, 

— X maps each location l £ L to a finite set X{1) of registers, and 

— r is a finite set of transitions , each of form (l,a(jp), g,n,l'), where 

• l £ L is a source location, 

• l' £ L is a target location, 

• a(p) is a parameterized symbol, 

• g is a guard over p and X(l), and 

• 7 r (the assignment ) is a mapping from X(l') to X(l) U {p} (meaning that 

the value of 7 r(x,) is assigned to the register Xi £ X{V)). □ 

We require register automata to be completely specified in the sense that when- 
ever there is an a-transitions from some location l £ L 1 then the disjunction of 
the guards on a-transitions from l is true. 

Let us now describe the semantics of an RA. A state of an RA A = (L, l 0l X , A, A) 
is a pair (l, v) where l £ L and v is a valuation over X(l), i.e., a mapping from 

X{1) to V. The state is initial if l = Iq. A step of A, denoted (l, v) a< ' d \ 
transfers A from (l,v) to {l',A) on input of the data symbol a(d) if there is a 
transition {l,a(p),g,Tr,l') £ A with 

1. v |= g[d/p], i.e., d satisfies the guard g under the valuation zz, and 

2. v' is the updated valuation with A(xi) = v{xj) if 7r(x*) = Xj, otherwise 
v'(xi) = d if 7r (xi) = p. 

A run of A over a data word w = a(di) . . . a(d n ) is a sequence of steps 

(Z 0 ,zz 0 ) ai(dl \ (h,ui) ... (l n - i,v n -i) °‘ n( ' dn -\ (l n , v n ) 

for some initial valuation zz 0 . The run is accepting if A (l n ) = + and rejecting if 
\(l n ) = — • The word w is accepted (rejected) by A under vq if A has an accepting 


(rejecting) run over w which starts in (Zq, r'o)- Note that an RA defined as above 
does not necessarily have runs over all data words. 

We define a simple register automaton (SRA) to be an RA with no registers 
in the initial location, whose runs over a given data word are either all accepting 
or all rejecting. We use SRAs as acceptors for data languages. 

3 Tree Queries 

In this section, we first define symbolic decision trees (SDTs), which are used to 
symbolically describe a fragment of a data language. We then state conditions 
for the construction of SDTs, which is done by a tree oracle. 

Symbolic decision trees. A symbolic decision tree (SDT) is an RA T = 
(L,lo,X,r,X) where L and r form a tree rooted at Iq. In general, an SDT 
has registers in the initial location; we use X{T) to denote these registers X{Iq). 
Thus, an SDT has well-defined semantics only wrt. a given valuation of X(7~). 

If l is a location of T, let T[l\ denote the subtree of T rooted at l. Let T 
and T' be two SDTs, such that 7 : X(T) 1 — > X(T') is a bijection from the initial 
registers of T to the initial registers of T. We say that T and T' are equivalent 
under 7, denoted T — 7 T 7 , if 7 can be extended to a bijection from all registers 
of T to all registers of T , , under which T and T' are isomorphic. 

Let a symbolic suffix be a sequence of actions in E* . Let u be a data word 
with V als{u) = d\,...,dk- Let v u be defined by u u (a 7) = <T We require that 
for each data word u and each guard g over p and Vals(u ), the guard g has 
a representative data value in V , denoted d®, such that v u |= g[d 9 l /p\ (i.e. , d® 
satisfies p after u ), and such that whenever g' is a stronger guard satisfied by d® 
(i.e., v u (= g[d 9 Jp}) then d( = d£. 

Definition 2. For a data language £, a data word u with Vals(u) = di, . . . , dk , 
and a set V of symbolic suffixes, a ( u,V)-tree is an SDT T that has runs over 
all data words v with Acts(v) £ V, such that v is accepted by T under v u iff 
uv £ C (and rejected iff uv ^ L) whenever Acts(v) £ V. Moreover, in any run of 
T over a data word u, the register Xi may contain only the value of the ith data 
value in uv. □ 

The last requirement simplifies the matching of decision trees. It can be enforced, 
e.g., by requiring that whenever (l,a(p), g,w,l') is the jth transition on some 
path from Zq , then for each Xi £ X(l’) we have either (i) i < k+j and n(xi) = Xi, 
or (ii) i = k + j and n(xi) = p (recall that k is the length of u). 

The initial a-transitions of an SDT are the transitions for action a from 
the root location l 0 , guarded by initial a-guards. The SDT in Fig. 2 [a] has two 
initial ack (p)-transitions with initial ack(p) -guards p = x 1 and p ^ X\. 

Tree oracles. A key concept in our approach is that of tree queries. Tree queries 
are made to a tree oracle, which returns an SDT. To ensure the consistency of 
tree queries, a tree oracle must satisfy the conditions in the following definition. 
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Fig. 3: [a] SDT for msg{p ) after prefixes e and msg(l)acfc(l). Refined SDTs for 
suffix msg(p)ack(p) after [b] e and [c] msg(\)ack(\). 


Definition 3. Let £ be a data language. A tree oracle for £ is a function Oc, 
which for a data word u and a set V of symbolic suffixes returns a (u, V)-tree 
T, and satisfies the following constraints. 

1. If V C V', then Oc{u,V') ~ 7 Oc{u,V') implies Oc{u,V) ~ 7 Oc(u,V) for 
all u,u r and 7 (i.e. , adding more symbolic suffixes cannot make inequivalent 
trees equivalent). 

2. If V C V', then for each initial a-transition of Oc{u,V ) with guard g, 
there is some initial a transition of Oc{u, V') with a stronger guard g' (i.e., 

”u\ =9' — > g)- 

3. If {lo,a(p),g,n,l) is an initial transition of Oc(u,V), then Oc{u,V)[l] ~ 7 

Oc(ua(d) : a~ x V), where d = d®, and 7 is the identify mapping (i.e., any 
subtree of Oc{u,V) must be isomorphic to the subtree after d: here a~ x V 
denotes the set of sequences a\ - ■ ■ a n such that aa\ ■ ■ ■ a n € V). □ 

The hrst two conditions in Def. 3 ensure monotonicity: First, extending V 
will only preserve or introduce inequivalence between trees of different prefixes. 
Second, by gradually extending V, we will only refine trees and not, e.g., merge 
transitions or forget registers. Fig. 3 [b] and [c] show SDTs that refine SDT [a]. 
SDT [b] refines [a] by adding an assignment x\ := p to the initial transition and 
by adding new transitions after the initial one. SDT [c] refines [a] by splitting 
the initial transition into two transitions with refined guards, and by initializing 
a register in the root location. The third condition ensures that it is sufficient 
to consider concrete prefixes with representative data values during learning. 

Finally, let two data words u and v! be equivalent, denoted by u =o c u ' if 
Oc(u , V) ~ 7 Oc(u', V ) for some 7 and any finite V. A data language C is regular 
if =o c has finite index. The regularity of C is relative to the implementation of 
tree queries, since =o c is defined on SDTs. 

The following adaptation of the Myhill/Nerode theorem provides the basis 
for convergence of the automata learning algorithm presented in the next section. 

Theorem 1 (Myhill-Nerode). Let £ be a data language, and let Oc be a 
tree oracle for C. If the equivalence =Oc has finite index, then there is an SRA 
which accepts precisely the language C. □ 
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Fig. 4: Hypothesis [a] (without error location I 2 ) and its observation table (right). 
Transitions [b] for suffix msg(p)ack{p) after prefix msg(l)ack(l) in hypothesis. 


4 The SL* Algorithm 


This section presents the central ideas for an active automata learning algorithm 
SL* ( Symbolic L* , reminiscent of the L* algorithm). To construct an SRA for 
some unknown data language, we need to infer locations, transitions, and reg- 
isters. Locations of an SRA can be characterized by their SDTs, which are 
obtained by making tree queries. Data words with equivalent SDTs will lead to 
the same location. The initial transitions of the SDTs will serve as transitions 
in the SRA. The registers of an SDT will become registers in the location that 
the SDT represents. A hypothesis automaton is constructed and submitted for 
an equivalence query. If it matches (which will happen eventually for regular data 
languages), the algorithm terminates. Otherwise, the returned counterexample 
is processed, leading to refinement of the hypothesis. 

The SL* algorithm maintains an observation table (U,V,Z), where U is a 
prefix-closed set of data words, called short prefixes, V is a set of symbolic 
suffixes, and Z maps each element u in U to its (u, V)-tree. The algorithm also 
maintains a finite set U + of extended prefixes of the form ua(d ) (abbreviated 
ua), such that u £ U and d is d®, where g is an initial a-guard of Z(u). Fig. 4 
(right) shows an observation table for the example in Sec. 1. A set of symbolic 
suffixes V labels the column; rows are labeled with short prefixes from U (above 
the double line) and with prefixes from U + (below the double line). Each table 
cell (referred to by row label u and column label V) stores the SDT Z(u). 

Algorithm 1 shows a pseudocode description of SL* . The algorithm is initial- 
ized (line 1) with U containing the empty word, the set of symbolic suffixes V 
being the empty sequence together with the set of all actions, and Z(e) being the 
SDT Oc{e, V). The algorithm then iterates three phases: hypothesis construction, 
hypothesis validation, and counterexample processing until no more counterex- 
amples are found, monotonically adding locations and transitions to hypothesis 
automata. We detail these phases below, referring to lines in Algorithm 1. 


Algorithm 1 SL* 

Require: A set E of actions, a data language C, a tree oracle Oc for C. 

Ensure: An SRA T-i with C(H) = C, 

1: U <— {e}, V <- ({e} U E), Z(e) <- Oc.lt, V ) > Initialization 

2 : loop 

3: repeat > Hypothesis construction 

4: U + <— {wa(d®) : u £ U , a £ E , and g initial a-guard of Z(u)j 

5: For each u £ (UUU+), Z{u) <- Oc(u,V) 

6 : if 3 m £ U + s.t. Z(u) qk 1 Z(u') for any 7 and u' £ U then 

7: U 4 — U U { m } 

8 : if 3 m a £ U + and 3%i £ X(Z(ua)) fl Valsfu) s.t. Xi ^ X(Z(u)) then 

9: V V U {qm} for v £ V with x% £ X(Oc(ua , {f})) 

10: until (U,V,Z) is closed and register-consistent 

11: T-L «— Hyp((U, V, Z)) 

12: if eq(H) then Return H > Hypothesis validation 

13: else > Counterexample processing 

14: for (Mi-i, Qj(p), pi, 7Tj, Mi) in run of 77 over a do 

15: if 7 ,; does not refine an initial trans. of Oc{ Wi-i, Vi_i) then V ■<— V U Vi_i 

16: if Oc{ui-iOti, Vi) £>c(ui, Vi) for 7 used to construct H then 

17: V<-V l> V 

18: end loop 


Hypothesis construction (lines 3-11). In this phase, the algorithm attempts 
to construct a hypothesis automaton by making tree queries and entering the 
results in an observation table. The answer to a tree query for the prefix u and 
the set of symbolic suffixes V is the SDT Oc(u, V), stored in the table as Z(u). 
An observation table (U, V, Z ) is 

— closed , if for every u € U + there is a short prefix u' £ U and a 7 such that 
Z(u) ~ 7 Z( u'). Closedness ensures that all transitions in the automaton 
have a target location. If the table is not closed, then u leads to a location 
not covered by U, and Z(u) proves it by not being equivalent to Z(v!) for 
any short prefix u'. ( U , V 1 Z) is closed by making u a short prefix, i.e. , adding 
it to U. 

— register- consistent, if (X (Z(ua) H Vals(u)) C X(Z(u)) for every ua £ U + . 
Register-consistency ensures that whenever a data value in u is needed to 
construct the SDT after ua, then it also occurs in the tree after u. If the 
table is not register-consistent, then Z(ua ) has a register that expects a value 
from u but Z{u) does not have a register for storing this value. We make 
(U, V, Z) register-consistent by extending V with the appropriate abstract 
word av with v € V, propagating the missing register backwards to Z(u). 

A closed and register-consistent observation table (U, V, Z) can be used to- 
gether with a set U + of extended prefixes to construct a hypothesis automaton 
Hyp((U, V, Z)) = (L, l 0 , A, T, A), where 

— L = U and Iq = e, 


— X maps each location u € U to X(Z{u )) (X(l 0 ) is the empty set), 

— A(it) = + if u £ £, otherwise A (u) = — , and 

— each ua £ (U U f/ + ) with corresponding initial a-transition {Iq, a(p), g, ir, V) 
of Z(u) generates a transition {u,a{p), g,n' ,u') in r , where 

• v! is the (unique) prefix in U with Z(ua) ~ 7 Z{u'), 

• 7 r' is an assignment X (Z(u')) (X (Z(u)) U {p}). For aq G X(Z(u')), 

we define n'(xi) = 7 _1 (xj) if 7 _1 (xj) stores a data value of u in Z(ua), 
and 7 r'(xj) = p otherwise. 

Fig. 4 shows an observation table that is closed and register-consistent. Fig. 4 [a] 
shows the hypothesis that can be constructed from it. In the table, rows for short 
prefixes (above the double line) are annotated with corresponding locations in 
the hypothesis. The assignment on the transition from l 0 to l\ and the guard on 
the transition from l\ to Iq are both derived from the SDT for prefix msg( 1). 

Hypothesis validation (line 12). The hypothesis automaton TL is submitted 
for an equivalence query. The teacher either replies ’OK’, or returns a counterex- 
ample (a word that is accepted by TL but rejected by the target system, or vice 
versa). If it replies ’OK’, the algorithm terminates and returns TL . Otherwise, 
the counterexample has to be analyzed. 

Counterexample analysis (lines 13-16). A counterexample indicates either 
that a location is missing, (i.e., that U has to be extended), or that a transition 
is missing, (i.e., that SDTs need to be refined), or that we used an incorrect 
renaming 7 between some SDTs when constructing the hypothesis. For a coun- 
terexample cr of length to we denote by cq its prefix of length i, and by 77 its 
suffix of length m — i. Moreover, let Vi be the singleton set {Acts(vi)}. 

In a run of H over <7, the i-th step (rq_ 1, zq_ 1) ^ ( Ui , 77} traverses tran- 

sition ( Ui -\ , ccj(p), <7*, 7Tj, Ui), i.e., prefix cr, leads to the location corresponding to 
short prefix u t from U. In order to determine at which step the run of % over cr 
diverges from the behavior of the system under learning, we analyze the sequence 
u 0 = e, . . . , u rn and the corresponding (iq, K,)-trees for 0 < * < m computed by 
@c( u iyVi)i using an argument similar to the one presented in [21]: Since cr is 
a counterexample and V contains e, there is an index j of the counterexample 
for which Uj-i together with Oc(uj-i,Vj-{) contains a counterexample to "H, 
while Uj and Ociuj, Vj) do not. We can then distinguish two cases. 

Case 1. The guard gj in the step of TL from Uj_i to u ? does not refine an 
initial transition of Oc(uj- i,V)_i). In this case the SDT distinguishes cases 
that TL does not distinguish. Adding Vj — 1 to V will result in new and refined 
transitions from Uj-\ in the hypothesis. This is guaranteed by the monotonicity 
requirement on tree constructors in Def. 3. Consider, e.g., the counterexample 
msg(l)ack(l)msg(l)ack(l) to the hypothesis in Fig. 4 at index 3. The hypothesis 
in Fig. 4 [b] has only one transition with guard true after TOs<7(l)acfc(l). The 
corresponding SDT for C seq (Fig. 3 [c] ) , on the other hand, has two initial 
transitions, and neither of them is refined by the true. Adding msg(p)ack(p) to 
V will add these transitions to the hypothesis. 


Case 2. The tree Oc(uj,Vj) is not isomorphic to the corresponding subtree 
after a(d®;'_ 1 ) of Oc{Uj-\,Vj-\) under the renaming of registers 7 that was 
used in the hypothesis (only one of these trees contains a counterexample to 
%). Adding Vj to V will lead to either Oc{uj,V) 9 k Oc(uj- ia(d®^._ 1 ), V) and 
Uj-ia(du :i l ) will become a separate location, or 7 will be refined. Consider again 
the counterexample msg(l)ack(l)msg(l)ack(l) to the hypothesis in Fig. 4; this 
time at index 2. Here, Uj— ia(d®^_ 1 ) is msg(l)ack(l), and Uj = u 2 is e. The 
SDTs for these two prefixes and the suffix msg(p)ack(p) are shown in Fig. 3 [b] 
and Fig. 3 [c]. They are not equivalent. Adding the suffix msg(p)ack(p) to V 
will lead to a new location for msg(l)ack(l) in the next hypothesis. 

Correctness and termination. That SL* returns a correct SRA upon ter- 
mination follows by the properties of our teacher. For regular data languages, 
termination follows from the properties of tree queries in Sec. 3, from Theorem 1, 
and from the algorithm itself: SDTs will only be refined when adding symbolic 
suffixes, and this can happen only finitely often. Each added symbolic suffix will 
either lead to a new transition, a refined transition, a new register assignment or 
a new location. By adapting arguments from other contexts [5, 16], Theorem 1 
can be used to show that SL* converges to a minimal (in terms of locations and 
registers) SRA for C. Note that this minimal number of locations and transitions 
also depends on the particular tree oracle that is used. 

Complexity. We estimate the worst case number of counterexamples and show 
how they lead to a correct model with n locations, t transitions, and at most r 
registers per location. Since each location has one access sequence, n < t, and 
thus we estimate the costs in t and r only. The final model is minimal relative 
to the implementation of tree queries: it has one location per class of =o c ■ Each 
counterexample results in one additional suffix in the observation table, leading 
to a new transition or to discarding a bijection between two prefixes in U . The 
former can happen t times before all transitions are identified. The latter can 
happen at most tr times, since it corresponds to breaking a symmetry between 
two of at most r registers at one of n < t locations (cf. [14]). The algorithm 
terminates after 0(tr) equivalence queries. The number of tree queries depends 
on the length m of the longest counterexample and on the size of the observation 
table. The algorithm uses a maximum of m calls per counterexample, and the 
size of U U U + in the final observation table is t + 1. This leads to 0(t 2 r + trm) 
tree queries and yields the following theorem. 

Theorem 2. The algorithm SL* infers a data language £ with 0(tr ) equiva- 
lence queries and 0(t 2 r + trm ) tree queries. □ 

5 Implementation and Evaluation 

We have implemented the SL* algorithm together with a teacher for a black-box 
scenario and fixed set of relations on integers and rationals. We allow equalities 
and/or inequalities as well as simple sums of registers and pre-defined constants 
(e.g., p = x\ + x 2 or p = xi + 5). 



Fig. 5: Connection establishment of TCP (only non-reflexive transitions). 


The implementation of tree queries O u (V ,) is based on the ideas for construct- 
ing canonical constraint decision trees presented in [9] (Proof of Theorem 1). The 
set of ^-distinguishable classes of data words of the form uv where Acts{v ) £ V 
can be represented in an SDT with maximally refined guards (so-called atoms). 
We use an SMT solver (Z3 4 ) to generate tests for all atoms in this SDT. Finally, 
atoms are merged in a bottom-up fashion based on test results. 

Equivalence queries have been implemented using tree queries (similar to the 
approach in [13]). We generate Oc(,£,w) for all w £ S k up to some depth A and 
compare the SDTs to the hypothesis. We start with k = 3 and increase k until 
a fixed time limit is reached (10 minutes) or until a counterexample is found. 

We have inferred a simplified version of the connection establishment phase 
of TCP, a bounded priority queue from the Java Class Library, and a set of five 
smaller models (Alternating-bit protocol, Sequence number, Timeout, an ATM, 
and a Fibonacci counter). Here, we only detail the TCP model. Fig. 5 shows the 
connection establishment phase of TCP. The example uses a set of five actions: 
init, syn, syn—ack , ack, and fin—ack. The transition init(p) was added to get 
an initial sequence number. Each synchronizing message increases this number; 
all other messages use the current sequence number. 

We used common optimizations for saving tests: a cache and a prefix-closure 
filter. Table 1 shows the results. We report the locations, variables, and transi- 
tions for all inferred models. For each case, we state the number of constants, 
relations (< denotes the combination of equalities and inequalities), and sup- 
ported terms: p + c indicates sums of parameters and constants, and p + p sums 
of different parameters. We report the number of tree queries (TQs) and equiv- 
alence queries (EQs) made. For equivalence queries, we also state the depth Aq 
at which the last counterexample was found and the greatest explored depth Aq 
(up to which inferred models are guaranteed to be correct). Finally, we show 
execution times. 

Time consumption for learning is below one second for most of the exam- 
ples; the only “real” Java class, the priority queue, takes a little more time (4.3 
seconds) . The difference between Ai and Aq gives an idea of how likely the final 
hypothesis is correct: If Aq is bigger than Ai, then the depth was increased by 
A 2 — Ai without finding a new counterexample. A big difference suggests that 
the learning algorithm has converged to the correct RA. For some examples 
no counterexamples where found and for the Timeout example A 2 = 00 , i.e. , 
the equivalence query terminated successfully. This was possible because all se- 
quences of length greater than two are not in the language of this example. For 

4 http : //z3 . codeplex . com 



Model 

Loc’s Var’s Trans’s 

Language class 
Const’s Rel’s Op’s 

Queries 
TQs EQs 

EQ 

ki /C2 

Times 

TQs [s] EQs [s] 

ABP 

3 

0 

5 

2 

= 

- 

9 

1 

- 11 

0.1 

599.9 

Sequence Number 

3 

1 

4 

1 

= 

p+c 

8 

1 

- 10 

0.1 

599.9 

TCP 

7 

1 

51 

1 

= 

p+c 

187 

2 

6 7 

0.6 

599.4 

PriorityQueue 

8 

2 

33 

0 

< 

- 

113 

5 

6 7 

4.3 

595.7 

Timeout 

4 

1 

5 

1 

< 

p+c 

9 

1 

- oo 

0.2 

0.1 

ATM 

3 

1 

7 

3 

< 

p+c 

16 

2 

3 4 

1.3 

598.7 

Fibonacci counter 

4 

2 

6 

0 

< 

p+p 

19 

2 

3 5 

0.2 

599.8 


Table 1: Experimental results obtained on a 2GHz Intel Core i7 with 8GB of 
memory running Linux kernel 3.8.0. 

the examples with more relations (<, and p + c or p + p) the reached depth 
fc 2 is smaller, regardless of the number of locations and transitions in the final 
model. This is due to the exploding number of 7?.-distinguishable classes of data 
words in such cases. One way of addressing this challenge in the future could be 
introducing typed parameters and using multiple simpler disjoint domains. 

6 Conclusions 

We have presented a symbolic learning algorithm which can be parameterized by 
methods for constructing symbolic decision trees and which infers models that 
capture both control and data aspects of a system. Our preliminary implemen- 
tation demonstrates that the approach can infer protocols comprising sequence 
numbers, time stamps, and variables that are manipulated using simple arith- 
metic operations or compared for inequality even in a black-box scenario. 

A particularly promising direction for future research will be the combination 
with white-box methods like symbolic execution, both for searching counterex- 
amples as well as for supporting construction of decision trees. We also plan to 
investigate decidability of tree queries and equivalence queries in our learning 
model for different data domains. 
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